Developing a Stratified Sampling Scheme for the Unmonitored
Systems Using K-means Clustering Method
Background
The aim of this analysis is to analyze SMP population using CWL
monitoring data, categorize it, and identify underrepresented SMP
features in our data set. We will also use clustering to group systems
with similar characteristics and develop a stratified sampling script to
randomly select a representative sample of systems proportional to their
cluster and system type. This will ensure a comprehensive data set
covering all types of SMPs with varying design metrics such as loading
ratios, drainage area, and storm size management. The output of this
analysis will help the data collection team with their future site
selection.
- Finding the over-represented/under-represented
SMPs
#Libraries
library(odbc)
library(DBI)
library(lubridate)
library(tidyverse)
library(stats)
library(gridExtra)
library(grid)
library(gtable)
library(ggtext)
library(dplyr)
library(ggplot2)
library(knitr)
library(plotly)
library(cluster)
library(factoextra)
# DB PG14
con <- dbConnect(odbc::odbc(), dsn = "mars14_data", uid = Sys.getenv("shiny_uid"), pwd = Sys.getenv("shiny_pwd"), MaxLongVarcharSize = 8190)
#Getting list of SMPs
cwl_smp <- dbGetQuery(con,"SELECT DISTINCT smp_id
FROM fieldwork.viw_deployment_full_cwl")
#Getting the greenit info
smpbdv <- dbGetQuery(con,"SELECT * FROM external.tbl_smpbdv")
sysbdv <- dbGetQuery(con,"SELECT * FROM external.tbl_systembdv")
greenit_unified <-dbGetQuery(con,"SELECT * FROM external.viw_greenit_unified")
# join the greenit table with our smp list
smp_features <- cwl_smp %>%
inner_join(smpbdv, by="smp_id")
# Unmonitored SMPs
unmonitored_smp <- read.csv("C:\\Users\\Farshad.Ebrahimi\\OneDrive - City of Philadelphia\\Github Projects\\smp-population-assesment\\unmonitored_sites.csv") %>%
select(smp_id = SMP.ID) %>%
distinct()
# unmonitored smp_features
unmonitored_smp_features <- unmonitored_smp %>%
inner_join(smpbdv, by="smp_id")
The code calculates the distribution of SMP types with CWL data in
the MARS database, providing an initial understanding of the
distribution of SMP subtypes.
#number of smp types
smp_type_break <- smp_features %>%
group_by(smp_smptype) %>%
summarise(count = n()) %>%
select(Type = smp_smptype, count)

#number of unmonitored smps
unmonitored_smp_type_break <- unmonitored_smp_features %>%
group_by(smp_smptype) %>%
summarise(count = n()) %>%
select(Type = smp_smptype, count)
Similar analysis is performed below to calculate the distribution of
SMP types in the entire population of SMP including the monitored and
unmonitored SMPs.

The analysis aims to determine the over-represented (normalized
percent > 1) and under-represented (normalized percent < 1) SMP
types by dividing the fraction of a specific SMP type in the monitored
data by the fraction in the entire population.

#combining all smps (monitored and unmonitored) and get fractions based on type of smp-finally normalize them
total_smp <- unmonitored_smp_type_break %>%
full_join(smp_type_break, by="Type")
total_smp[is.na(total_smp)] <- 0
total_smp <- total_smp %>%
mutate(total_number = count.x+count.y) %>%
select(Type, total_number)
total_smp <- total_smp %>%
mutate(sum_all = sum(total_number)) %>%
mutate(percent_all = (total_number/sum_all)*100)
#get the percent of monitored SMP
smp_type_break_fraction <- smp_type_break %>%
mutate(sum_all = sum(count)) %>%
mutate(percent_monitored = (count/sum_all)*100)
#normalize the monitored smp by the total smp percents
normalized_fractions <- smp_type_break_fraction %>%
full_join(total_smp, by="Type") %>%
mutate(normalized_percent = percent_monitored/percent_all) %>%
select(Type, percent_monitored, percent_all, normalized_percent)
normalized_fractions[is.na(normalized_fractions)] <- 0
#calculating the normalized percentage of smp types
kable(arrange(normalized_fractions, normalized_percent), caption = "Normalized SMP Break-down")
Normalized SMP Break-down
| Green Roof |
0.0000000 |
0.1490313 |
0.0000000 |
| Stormwater Tree |
0.0000000 |
4.1728763 |
0.0000000 |
| Bumpout |
2.8017241 |
8.5692996 |
0.3269490 |
| Swale |
1.2931034 |
2.5335320 |
0.5103955 |
| Planter |
5.8189655 |
7.5260805 |
0.7731734 |
| Infiltration/Storage Trench |
25.8620690 |
29.4336811 |
0.8786556 |
| Basin |
0.2155172 |
0.2235469 |
0.9640805 |
| Rain Garden |
12.0689655 |
11.1028316 |
1.0870169 |
| Tree Trench |
50.6465517 |
35.6929955 |
1.4189493 |
| Pervious Paving |
0.4310345 |
0.2980626 |
1.4461207 |
| Drainage Well |
0.8620690 |
0.2980626 |
2.8922414 |
The table above shows Tree trenches are over-represented in our CWL
data, while bumpout, swale and planters are under-represented (due to
monitoring limitation, etc). However, it should be noted that MARS does
not plan on monitoring greef roofs due to limitations on access to these
systems and a 1:1 loading ratios. Stromwater trees, smilarly, does not
have any CWL data and the monitoring efforts are limited to CET tests
only. Porous pavements are also limited to infiltration rate testing and
have no CWL data.
- Clustering at the system-level
The current section aims to categorize the systems within each broad
subtypes. The reason to perform this at the system level is the
accessibility of design metrics at the system level. The clustering
within each type of systems is done to ensure the sampling approach
picks an appropriate number of systems from systems possessing similar
characteristics within these types. For example, the sampling method
should select a larger number of Bioinfiltration systems with higher
drainage area if those systems constitute the majority of
Bioinfiltration systems.
The binning approach considers several system features like loading
ratios, drainage area, and storm size management. The code chunks below
demonstrate a clustering method applied to unmonitored systems,
regardless of subtypes, as an example of how it can group systems based
solely on these three design metrics. The clustering will then be
repeated within each subtype to provide a more relevant method for
sampling.
#rawstormsizemanaged-in , sys_impervda_ft2, and loading ratio metrics gathering
table_all <- unmonitored_smp_features %>%
left_join(sysbdv, by="system_id") %>%
mutate(loading_ratio = sys_lrtotalda_ft2 )
K-means clustering is an unsupervised learning method that groups
data points based on their proximity to centroids, which are calculated
as the mean of all points in a cluster. The algorithm starts by randomly
selecting initial centroids, and then repeatedly adjusts them until
convergence is achieved
The optimal number of clusters is determined using the Elbow method.
In summary, this method picks the number of clusters in which a the
slope of the lines changes more drastically (cluster 4 in the following
graph).
#Clustring analysis based on 3 features of drainage area, loading ratio, and storm sized managed
clustered_moniored <- table_all %>%
select(system_id, sys_impervda_ft2, loading_ratio, sys_rawstormsizemanaged_in) %>%
na.omit() %>%
distinct()
#normalze using mean-sd standardization
normalized_cluster <- clustered_moniored
normalized_cluster[,2:4] <- scale(clustered_moniored[,2:4])
#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 4)
# Store the cluster assignments back into the clustering data frame object
clustered_moniored$cluster <-factor(cluster_solution$cluster)
# Look at the distribution of cluster assignments
table(clustered_moniored$cluster)
1 2 3 4
21 279 136 11
The above table displays the number of systems in each cluster. The
subsequent tables aim to determine the average of system design metrics
in each category. The data can be analyzed by examining each category;
for instance, Cluster Four encompasses systems with a large drainage
area, capable of handling larger storm sizes (>3 inch/hr).
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_moniored %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean)
# View the resulting table
kable(sys_clus_avg)
| 1 |
17695.05 |
68.36332 |
1.491118 |
| 2 |
20363.52 |
12.75067 |
1.886491 |
| 3 |
17350.55 |
23.52892 |
1.212609 |
| 4 |
62204.82 |
10.56821 |
3.843215 |
ggplot(data = clustered_moniored, aes(x = loading_ratio, y = sys_rawstormsizemanaged_in, color = cluster)) +
geom_point()

plot_ly(clustered_moniored, x = ~sys_impervda_ft2, y = ~loading_ratio, z = ~sys_rawstormsizemanaged_in) %>%
add_markers(color = ~cluster)
In order to create a more pertinent clustering method that
distinguishes between general types of systems (lined vs. unlined or
bioretention vs. bioinfiltration), the unmonitored systems were divided
into broad categories, including Bioinfiltration, Bioretention (lined),
Bioretention (unlined), Subsurface Infiltration, Subsurface Slow Release
(lined), and Subsurface Slow Release (unlined) using the system-level
feature called sys_modelinputcategory. The code chunk below displays a
breakdown of these SMPs. The clustering algorithm was repeated for each
of those groups similarly. Please note that Green Roofs and Permeable
Pavements were not included due to a low number of systems.
#sys_modelinputcategory categories
sys_cat <- table_all %>%
select(system_id, sys_modelinputcategory)%>%
distinct %>%
group_by(sys_modelinputcategory) %>%
summarise(count = n())
kable(sys_cat)
| Bioinfiltration |
124 |
| Bioretention (lined) |
29 |
| Bioretention (unlined) |
83 |
| Green Roof |
2 |
| Permeable Pavement |
1 |
| Subsurface infiltration |
159 |
| Subsurface slow release (lined) |
113 |
| Subsurface slow release (unlined) |
113 |
-Bioinfiltration Clustering:
clustered_Bioinfiltration <- table_all %>%
filter(sys_modelinputcategory == "Bioinfiltration") %>%
select(system_id, sys_impervda_ft2, loading_ratio, sys_rawstormsizemanaged_in, sys_modelinputcategory) %>%
na.omit() %>%
distinct()
#normalze using mean-sd standardization
normalized_cluster <- clustered_Bioinfiltration
normalized_cluster[,2:4] <- scale(clustered_Bioinfiltration[,2:4])
#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 4)
# Store the cluster assignments back into the clustering data frame object
clustered_Bioinfiltration$cluster <-factor(cluster_solution$cluster)
# Look at the distribution of cluster assignments
table(clustered_Bioinfiltration$cluster)
1 2 3 4
17 31 72 4
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_Bioinfiltration %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean)
# View the resulting table
kable(sys_clus_avg)
| 1 |
11184.24 |
40.58019 |
1.063568 |
| 2 |
51515.00 |
15.79839 |
1.752800 |
| 3 |
15718.52 |
12.39893 |
1.778769 |
| 4 |
22358.50 |
10.18006 |
4.348037 |
#3 D plot
plot_ly(clustered_Bioinfiltration, x = ~sys_impervda_ft2, y = ~loading_ratio, z = ~sys_rawstormsizemanaged_in) %>%
add_markers(color = ~cluster)
-Bioretention (lined) clustering. Note that loading ratio
field is not applicable here:
clustered_Bioreten_lined <- table_all %>%
filter(sys_modelinputcategory == "Bioretention (lined)") %>%
select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, sys_modelinputcategory) %>%
na.omit() %>%
distinct()
#normalze using mean-sd standardization
normalized_cluster <- clustered_Bioreten_lined
normalized_cluster[,2:3] <- scale(clustered_Bioreten_lined[,2:3])
#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:3], kmeans, method = "wss")

#k-means
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:3], centers = 2)
# Store the cluster assignments back into the clustering data frame object
clustered_Bioreten_lined$cluster <-factor(cluster_solution$cluster)
# Look at the distribution of cluster assignments
table(clustered_Bioreten_lined$cluster)
1 2
5 24
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_Bioreten_lined %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean)
# View the resulting table
kable(sys_clus_avg)
| 1 |
73919.40 |
2.270016 |
| 2 |
18291.83 |
1.473926 |
#3 D plot
plot_ly(clustered_Bioreten_lined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in) %>%
add_markers(color = ~cluster)
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
-Bioretention (unlined) clustering:
clustered_Bioreten_unlined <- table_all %>%
filter(sys_modelinputcategory == "Bioretention (unlined)") %>%
select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, loading_ratio, sys_modelinputcategory) %>%
na.omit() %>%
distinct()
#normalze using mean-sd standardization
normalized_cluster <- clustered_Bioreten_unlined
normalized_cluster[,2:4] <- scale(clustered_Bioreten_unlined[,2:4])
#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 3)
# Store the cluster assignments back into the clustering data frame object
clustered_Bioreten_unlined$cluster <-factor(cluster_solution$cluster)
# Look at the distribution of cluster assignments
table(clustered_Bioreten_unlined$cluster)
1 2 3
9 21 53
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_Bioreten_unlined %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean)
# View the resulting table
kable(sys_clus_avg)
| 1 |
7046.333 |
3.309078 |
4.263439 |
| 2 |
42812.095 |
1.525561 |
20.624402 |
| 3 |
16789.575 |
1.639162 |
12.959101 |
#3 D plot
plot_ly(clustered_Bioreten_unlined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in, z= ~loading_ratio) %>%
add_markers(color = ~cluster)
-Subsurface infiltration clustering:
clustered_subsurface <- table_all %>%
filter(sys_modelinputcategory == "Subsurface infiltration") %>%
select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, loading_ratio, sys_modelinputcategory) %>%
na.omit() %>%
distinct()
#normalze using mean-sd standardization
normalized_cluster <- clustered_subsurface
normalized_cluster[,2:4] <- scale(clustered_subsurface[,2:4])
#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 2)
# Store the cluster assignments back into the clustering data frame object
clustered_subsurface$cluster <-factor(cluster_solution$cluster)
# Look at the distribution of cluster assignments
table(clustered_subsurface$cluster)
1 2
99 26
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_subsurface %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean)
# View the resulting table
kable(sys_clus_avg)
| 1 |
19525.382 |
1.837962 |
12.69806 |
| 2 |
7850.692 |
1.030724 |
31.54683 |
#3 D plot
plot_ly(clustered_subsurface, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in, z= ~loading_ratio) %>%
add_markers(color = ~cluster)
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
-Subsurface slow release (lined) clustering:
clustered_slowrel_lined <- table_all %>%
filter(sys_modelinputcategory == "Subsurface slow release (lined)") %>%
select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, sys_modelinputcategory) %>%
na.omit() %>%
distinct()
#normalze using mean-sd standardization
normalized_cluster <- clustered_slowrel_lined
normalized_cluster[,2:3] <- scale(clustered_slowrel_lined[,2:3])
#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:3], kmeans, method = "wss")

#k-means
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:3], centers = 2)
# Store the cluster assignments back into the clustering data frame object
clustered_slowrel_lined$cluster <-factor(cluster_solution$cluster)
# Look at the distribution of cluster assignments
table(clustered_slowrel_lined$cluster)
1 2
58 55
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_slowrel_lined %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean)
# View the resulting table
kable(sys_clus_avg)
| 1 |
13139.92 |
1.195020 |
| 2 |
20312.98 |
1.720648 |
#3 D plot
plot_ly(clustered_slowrel_lined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in) %>%
add_markers(color = ~cluster)
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
-Subsurface slow release (unlined) clustering:
clustered_slowrel_unlined <- table_all %>%
filter(sys_modelinputcategory == "Subsurface slow release (unlined)") %>%
select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, loading_ratio, sys_modelinputcategory) %>%
na.omit() %>%
distinct()
#normalze using mean-sd standardization
normalized_cluster <- clustered_slowrel_unlined
normalized_cluster[,2:3] <- scale(clustered_slowrel_unlined[,2:3])
#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:3], kmeans, method = "wss")

#k-means
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:3], centers = 2)
# Store the cluster assignments back into the clustering data frame object
clustered_slowrel_unlined$cluster <-factor(cluster_solution$cluster)
# Look at the distribution of cluster assignments
table(clustered_slowrel_unlined$cluster)
1 2
61 52
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_slowrel_unlined %>%
group_by(cluster) %>%
summarize_if(is.numeric, mean)
# View the resulting table
kable(sys_clus_avg)
| 1 |
18886.52 |
1.910538 |
19.26955 |
| 2 |
16819.60 |
1.337858 |
34.30980 |
#3 D plot
plot_ly(clustered_slowrel_unlined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in, z = ~loading_ratio) %>%
add_markers(color = ~cluster)
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Upon clustering the systems, the data will be collected in a unified
data frame for stratified sampling. The stratified sampling creates a
balanced list of systems that are reflective of the size of each system
type and that of the internal clusters. For example, the code below has
generated 58 system (~ 10% of total unmonitored systems); the size of
each system type is proportional to that of the population, and within
each type of systems, a proportional number of systems has been chosen
from each cluster e.g., in Bioinfiltration type systems, 7 systems are
randomly chosen from cluster 3, 3 systems from cluster 2, and 2 systems
from cluster 1.
#binding all sub-type groups into one table
sys_clustered_all <- bind_rows(clustered_Bioinfiltration[,c("system_id","cluster","sys_modelinputcategory")], clustered_Bioreten_lined[,c("system_id","cluster","sys_modelinputcategory")], clustered_Bioreten_unlined[,c("system_id","cluster","sys_modelinputcategory")], clustered_slowrel_lined[,c("system_id","cluster","sys_modelinputcategory")], clustered_slowrel_unlined[,c("system_id","cluster","sys_modelinputcategory")], clustered_subsurface[,c("system_id","cluster","sys_modelinputcategory")])
#sampling code
stratified_sample <- sys_clustered_all %>%
group_by(sys_modelinputcategory, cluster) %>%
sample_frac(size=0.1)
kable(stratified_sample)
| 1023-11 |
1 |
Bioinfiltration |
| 443-1 |
1 |
Bioinfiltration |
| 1083-6 |
2 |
Bioinfiltration |
| 1127-1 |
2 |
Bioinfiltration |
| 1195-3 |
2 |
Bioinfiltration |
| 991-1 |
3 |
Bioinfiltration |
| 310-6 |
3 |
Bioinfiltration |
| 410-2 |
3 |
Bioinfiltration |
| 1315-1 |
3 |
Bioinfiltration |
| 1140-2 |
3 |
Bioinfiltration |
| 244-1 |
3 |
Bioinfiltration |
| 416-1 |
3 |
Bioinfiltration |
| 1007-2 |
2 |
Bioretention (lined) |
| 1067-4 |
2 |
Bioretention (lined) |
| 1178-6 |
1 |
Bioretention (unlined) |
| 1273-6 |
2 |
Bioretention (unlined) |
| 578-2 |
2 |
Bioretention (unlined) |
| 595-4 |
3 |
Bioretention (unlined) |
| 596-4 |
3 |
Bioretention (unlined) |
| 1265-3 |
3 |
Bioretention (unlined) |
| 1200-3 |
3 |
Bioretention (unlined) |
| 588-2 |
3 |
Bioretention (unlined) |
| 443-3 |
1 |
Subsurface infiltration |
| 1051-15 |
1 |
Subsurface infiltration |
| 280-3 |
1 |
Subsurface infiltration |
| 344-1 |
1 |
Subsurface infiltration |
| 1070-6 |
1 |
Subsurface infiltration |
| 1281-27 |
1 |
Subsurface infiltration |
| 637-1 |
1 |
Subsurface infiltration |
| 1010-5 |
1 |
Subsurface infiltration |
| 1242-1 |
1 |
Subsurface infiltration |
| 1138-1 |
1 |
Subsurface infiltration |
| 441-13 |
2 |
Subsurface infiltration |
| 1298-1 |
2 |
Subsurface infiltration |
| 441-12 |
2 |
Subsurface infiltration |
| 1059-3 |
1 |
Subsurface slow release (lined) |
| 1240-2 |
1 |
Subsurface slow release (lined) |
| 1202-2 |
1 |
Subsurface slow release (lined) |
| 525-1 |
1 |
Subsurface slow release (lined) |
| 554-7 |
1 |
Subsurface slow release (lined) |
| 994-3 |
1 |
Subsurface slow release (lined) |
| 634-1 |
2 |
Subsurface slow release (lined) |
| 1011-3 |
2 |
Subsurface slow release (lined) |
| 1267-2 |
2 |
Subsurface slow release (lined) |
| 1206-2 |
2 |
Subsurface slow release (lined) |
| 1129-4 |
2 |
Subsurface slow release (lined) |
| 1287-6 |
2 |
Subsurface slow release (lined) |
| 1051-17 |
1 |
Subsurface slow release (unlined) |
| 595-10 |
1 |
Subsurface slow release (unlined) |
| 1209-6 |
1 |
Subsurface slow release (unlined) |
| 1275-2 |
1 |
Subsurface slow release (unlined) |
| 1062-5 |
1 |
Subsurface slow release (unlined) |
| 1137-6 |
1 |
Subsurface slow release (unlined) |
| 1070-11 |
2 |
Subsurface slow release (unlined) |
| 1265-11 |
2 |
Subsurface slow release (unlined) |
| 989-1 |
2 |
Subsurface slow release (unlined) |
| 1359-2 |
2 |
Subsurface slow release (unlined) |
| 1051-3 |
2 |
Subsurface slow release (unlined) |
To generate monitoring systems, we developed a semi-random approach
that utilizes a binning technique. This approach involves assigning
systems to clusters based on their specific type, with the binning
process utilizing a combination of design metrics to group similar
systems together. To ensure a representative sample of systems, the
random selection process chooses a proportional number of systems from
each cluster and system type based on their size.
---
title: "SMP Population Assesment"
output: html_notebook
author: "Farshad Ebrahimi, Feb/7/2023"
---

**Developing a Stratified Sampling Scheme for the Unmonitored Systems Using K-means Clustering Method**

**Background**

The aim of this analysis is to analyze SMP population using CWL monitoring data, categorize it, and identify underrepresented SMP features in our data set. We will also use clustering to group systems with similar characteristics and develop a stratified sampling script to randomly select a representative sample of systems proportional to their cluster and system type. This will ensure a comprehensive data set covering all types of SMPs with varying design metrics such as loading ratios, drainage area, and storm size management. The output of this analysis will help the data collection team with their future site selection.

1.  **Finding the over-represented/under-represented SMPs**

```{r section 1, Loading libraries and querying smp data }
#Libraries
      library(odbc)
      library(DBI)
      library(lubridate)
      library(tidyverse)
      library(stats)
      library(gridExtra)
      library(grid)
      library(gtable)
      library(ggtext)
      library(dplyr)
      library(ggplot2)
      library(knitr)
      library(plotly)
      library(cluster)
      library(factoextra)
# DB PG14
    con <- dbConnect(odbc::odbc(), dsn = "mars14_data", uid = Sys.getenv("shiny_uid"), pwd = Sys.getenv("shiny_pwd"), MaxLongVarcharSize = 8190)

#Getting list of SMPs
    cwl_smp <- dbGetQuery(con,"SELECT DISTINCT smp_id
           FROM fieldwork.viw_deployment_full_cwl")
    
#Getting the greenit info
    smpbdv <- dbGetQuery(con,"SELECT * FROM external.tbl_smpbdv")
    sysbdv <- dbGetQuery(con,"SELECT * FROM external.tbl_systembdv")
    greenit_unified <-dbGetQuery(con,"SELECT * FROM external.viw_greenit_unified")

# join the greenit table with our smp list
    smp_features <- cwl_smp %>% 
      inner_join(smpbdv, by="smp_id")
# Unmonitored SMPs
    unmonitored_smp <- read.csv("C:\\Users\\Farshad.Ebrahimi\\OneDrive - City of Philadelphia\\Github Projects\\smp-population-assesment\\unmonitored_sites.csv") %>%
      select(smp_id = SMP.ID) %>%
      distinct()
# unmonitored smp_features
    unmonitored_smp_features <- unmonitored_smp %>%
      inner_join(smpbdv, by="smp_id")
```

The code calculates the distribution of SMP types with CWL data in the MARS database, providing an initial understanding of the distribution of SMP subtypes.

```{r section 2, Monitored SMP break-down based on type}
#number of smp types
smp_type_break <- smp_features %>% 
  group_by(smp_smptype) %>%
  summarise(count = n()) %>%
  select(Type = smp_smptype, count)
```

```{r section 3 pie chart of monitored SMPs, echo=False}
ggplot(smp_type_break, aes(x="", y= count, fill=Type))+geom_bar(width = 1, stat = "identity")+coord_polar("y", start=0)
```

```{r section 4, looking at the unmonitored SMPs to see the break-down}
#number of unmonitored smps
unmonitored_smp_type_break <- unmonitored_smp_features %>% 
  group_by(smp_smptype) %>%
  summarise(count = n()) %>%
  select(Type = smp_smptype, count)
```

Similar analysis is performed below to calculate the distribution of SMP types in the entire population of SMP including the monitored and unmonitored SMPs.

```{r setion 5 pie chart of unmonitored SMPs, echo=FALSE}
ggplot(unmonitored_smp_type_break, aes(x="", y= count, fill=Type))+geom_bar(width = 1, stat = "identity")+coord_polar("y", start=0)
```

The analysis aims to determine the over-represented (normalized percent \> 1) and under-represented (normalized percent \< 1) SMP types by dividing the fraction of a specific SMP type in the monitored data by the fraction in the entire population.

```{r section 6 full join the unmonitored and monitored, echo=FALSE}
#full join the unmonitored and monirored and create a column to sum them up
smp_breakdwon_full <- smp_type_break %>%
  full_join(unmonitored_smp_type_break, by="Type") %>%
  select(Type, count_monitored = count.x, count_unmonitored = count.y)

#replace na with zero
smp_breakdwon_full[is.na(smp_breakdwon_full)] <- 0

#mutate the third column
smp_breakdwon_full <- smp_breakdwon_full %>%
  mutate(sum_all = count_monitored + count_unmonitored)

#data frame for plotting (only two columns)
pivot_df <- smp_breakdwon_full %>%
  select(Type, count_monitored, sum_all)

df_longer <- pivot_longer(pivot_df, cols = c("count_monitored", "sum_all"), names_to = "category", values_to = "count")

#Plot
ggplot(df_longer,aes(x = Type, y = count, fill= category)) +
      geom_bar(stat="identity", position=position_dodge())+
      theme(axis.text=element_text(size=6, angle = 35),
        axis.title=element_text(size=10))

```

```{r section 7, adding the monitored and unmoniored}
#combining all smps (monitored and unmonitored) and get fractions based on type of smp-finally normalize them
total_smp <- unmonitored_smp_type_break %>%
  full_join(smp_type_break, by="Type")
total_smp[is.na(total_smp)] <- 0
total_smp <- total_smp %>%
  mutate(total_number = count.x+count.y) %>%
  select(Type, total_number)
total_smp <- total_smp %>%
    mutate(sum_all = sum(total_number)) %>%
    mutate(percent_all = (total_number/sum_all)*100)

#get the percent of monitored SMP
smp_type_break_fraction <- smp_type_break %>%
    mutate(sum_all = sum(count)) %>%
    mutate(percent_monitored = (count/sum_all)*100)

#normalize the monitored smp by the total smp percents
normalized_fractions <- smp_type_break_fraction %>%
  full_join(total_smp, by="Type") %>%
  mutate(normalized_percent = percent_monitored/percent_all) %>%
  select(Type, percent_monitored, percent_all, normalized_percent)

normalized_fractions[is.na(normalized_fractions)] <- 0
```

```{r}
#calculating the normalized percentage of smp types
kable(arrange(normalized_fractions, normalized_percent), caption = "Normalized SMP Break-down")
```

The table above shows Tree trenches are over-represented in our CWL data, while bumpout, swale and planters are under-represented (due to monitoring limitation, etc). However, it should be noted that MARS does not plan on monitoring greef roofs due to limitations on access to these systems and a 1:1 loading ratios. Stromwater trees, smilarly, does not have any CWL data and the monitoring efforts are limited to CET tests only. Porous pavements are also limited to infiltration rate testing and have no CWL data.

2.  **Clustering at the system-level**

The current section aims to categorize the systems within each broad subtypes. The reason to perform this at the system level is the accessibility of design metrics at the system level. The clustering within each type of systems is done to ensure the sampling approach picks an appropriate number of systems from systems possessing similar characteristics within these types. For example, the sampling method should select a larger number of Bioinfiltration systems with higher drainage area if those systems constitute the majority of Bioinfiltration systems.

The binning approach considers several system features like loading ratios, drainage area, and storm size management. The code chunks below demonstrate a clustering method applied to unmonitored systems, regardless of subtypes, as an example of how it can group systems based solely on these three design metrics. The clustering will then be repeated within each subtype to provide a more relevant method for sampling.

```{r section 8, rawstormsizemanaged-in , sys_impervda_ft2, and loading ratio metrics gathering}
#rawstormsizemanaged-in , sys_impervda_ft2, and loading ratio metrics gathering

table_all <- unmonitored_smp_features %>%
  left_join(sysbdv, by="system_id") %>%
  mutate(loading_ratio = sys_lrtotalda_ft2 )
```

K-means clustering is an unsupervised learning method that groups data points based on their proximity to centroids, which are calculated as the mean of all points in a cluster. The algorithm starts by randomly selecting initial centroids, and then repeatedly adjusts them until convergence is achieved

The optimal number of clusters is determined using the Elbow method. In summary, this method picks the number of clusters in which a the slope of the lines changes more drastically (cluster 4 in the following graph).

```{r section 9}
#Clustring analysis based on 3 features of drainage area, loading ratio, and storm sized managed

clustered_moniored <- table_all %>%
  select(system_id, sys_impervda_ft2, loading_ratio, sys_rawstormsizemanaged_in) %>%
  na.omit() %>%
  distinct()

#normalze using mean-sd standardization
normalized_cluster <- clustered_moniored
normalized_cluster[,2:4] <- scale(clustered_moniored[,2:4])

#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means 
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 4)

# Store the cluster assignments back into the clustering data frame object
clustered_moniored$cluster <-factor(cluster_solution$cluster) 

# Look at the distribution of cluster assignments
table(clustered_moniored$cluster)
```

The above table displays the number of systems in each cluster. The subsequent tables aim to determine the average of system design metrics in each category. The data can be analyzed by examining each category; for instance, Cluster Four encompasses systems with a large drainage area, capable of handling larger storm sizes (\>3 inch/hr).

```{r section 10 analyzing the cluster features}
# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_moniored %>%
    group_by(cluster) %>%
    summarize_if(is.numeric, mean)

# View the resulting table
kable(sys_clus_avg)
```

```{r 2D Plot}
ggplot(data = clustered_moniored, aes(x = loading_ratio, y = sys_rawstormsizemanaged_in, color = cluster)) +
    geom_point()
```

```{r 3D plot}
plot_ly(clustered_moniored, x = ~sys_impervda_ft2, y = ~loading_ratio, z = ~sys_rawstormsizemanaged_in) %>%
  add_markers(color = ~cluster)
```

In order to create a more pertinent clustering method that distinguishes between general types of systems (lined vs. unlined or bioretention vs. bioinfiltration), the unmonitored systems were divided into broad categories, including Bioinfiltration, Bioretention (lined), Bioretention (unlined), Subsurface Infiltration, Subsurface Slow Release (lined), and Subsurface Slow Release (unlined) using the system-level feature called sys_modelinputcategory. The code chunk below displays a breakdown of these SMPs. The clustering algorithm was repeated for each of those groups similarly. Please note that Green Roofs and Permeable Pavements were not included due to a low number of systems.

```{r section 11 sys_modelinputcategory is used to categorize systems}
#sys_modelinputcategory categories
sys_cat <- table_all %>%
  select(system_id, sys_modelinputcategory)%>%
  distinct %>%
  group_by(sys_modelinputcategory) %>%
  summarise(count = n())

kable(sys_cat)
```

**-Bioinfiltration Clustering:**

```{r section 12 clustering Bioinfiltration type systems }

clustered_Bioinfiltration <- table_all %>%
  filter(sys_modelinputcategory == "Bioinfiltration") %>%
  select(system_id, sys_impervda_ft2, loading_ratio, sys_rawstormsizemanaged_in, sys_modelinputcategory) %>%
  na.omit() %>%
  distinct()

#normalze using mean-sd standardization
normalized_cluster <- clustered_Bioinfiltration
normalized_cluster[,2:4] <- scale(clustered_Bioinfiltration[,2:4])

#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means 
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 4)

# Store the cluster assignments back into the clustering data frame object
clustered_Bioinfiltration$cluster <-factor(cluster_solution$cluster) 

# Look at the distribution of cluster assignments
table(clustered_Bioinfiltration$cluster)

# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_Bioinfiltration %>%
    group_by(cluster) %>%
    summarize_if(is.numeric, mean)

# View the resulting table
kable(sys_clus_avg)

#3 D plot
plot_ly(clustered_Bioinfiltration, x = ~sys_impervda_ft2, y = ~loading_ratio, z = ~sys_rawstormsizemanaged_in) %>%
  add_markers(color = ~cluster)
```

**-Bioretention (lined) clustering. Note that loading ratio field is not applicable here:**

```{r section 13 Bioretention lined clustering}

clustered_Bioreten_lined <- table_all %>%
  filter(sys_modelinputcategory == "Bioretention (lined)") %>%
  select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, sys_modelinputcategory) %>%
  na.omit() %>%
  distinct()

#normalze using mean-sd standardization
normalized_cluster <- clustered_Bioreten_lined
normalized_cluster[,2:3] <- scale(clustered_Bioreten_lined[,2:3])

#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:3], kmeans, method = "wss")

#k-means 
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:3], centers = 2)

# Store the cluster assignments back into the clustering data frame object
clustered_Bioreten_lined$cluster <-factor(cluster_solution$cluster) 

# Look at the distribution of cluster assignments
table(clustered_Bioreten_lined$cluster)

# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_Bioreten_lined %>%
    group_by(cluster) %>%
    summarize_if(is.numeric, mean)

# View the resulting table
kable(sys_clus_avg)

#3 D plot
plot_ly(clustered_Bioreten_lined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in) %>%
  add_markers(color = ~cluster)
```

**-Bioretention (unlined) clustering:**

```{r section 14 Bioretention unlined clustering}

clustered_Bioreten_unlined <- table_all %>%
  filter(sys_modelinputcategory == "Bioretention (unlined)") %>%
  select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, loading_ratio, sys_modelinputcategory) %>%
  na.omit() %>%
  distinct()

#normalze using mean-sd standardization
normalized_cluster <- clustered_Bioreten_unlined
normalized_cluster[,2:4] <- scale(clustered_Bioreten_unlined[,2:4])

#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means 
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 3)

# Store the cluster assignments back into the clustering data frame object
clustered_Bioreten_unlined$cluster <-factor(cluster_solution$cluster) 

# Look at the distribution of cluster assignments
table(clustered_Bioreten_unlined$cluster)

# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_Bioreten_unlined %>%
    group_by(cluster) %>%
    summarize_if(is.numeric, mean)

# View the resulting table
kable(sys_clus_avg)

#3 D plot
plot_ly(clustered_Bioreten_unlined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in, z= ~loading_ratio) %>%
  add_markers(color = ~cluster)
```

**-Subsurface infiltration clustering:**

```{r section 15 Subsurface infiltration clustering}

clustered_subsurface <- table_all %>%
  filter(sys_modelinputcategory == "Subsurface infiltration") %>%
  select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, loading_ratio, sys_modelinputcategory) %>%
  na.omit() %>%
  distinct()

#normalze using mean-sd standardization
normalized_cluster <- clustered_subsurface
normalized_cluster[,2:4] <- scale(clustered_subsurface[,2:4])

#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:4], kmeans, method = "wss")

#k-means 
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:4], centers = 2)

# Store the cluster assignments back into the clustering data frame object
clustered_subsurface$cluster <-factor(cluster_solution$cluster) 

# Look at the distribution of cluster assignments
table(clustered_subsurface$cluster)

# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_subsurface %>%
    group_by(cluster) %>%
    summarize_if(is.numeric, mean)

# View the resulting table
kable(sys_clus_avg)

#3 D plot
plot_ly(clustered_subsurface, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in, z= ~loading_ratio) %>%
  add_markers(color = ~cluster)
```

**-Subsurface slow release (lined) clustering:**

```{r section 16 Subsurface slow release (lined)	 clustering}

clustered_slowrel_lined <- table_all %>%
  filter(sys_modelinputcategory == "Subsurface slow release (lined)") %>%
  select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, sys_modelinputcategory) %>%
  na.omit() %>%
  distinct()

#normalze using mean-sd standardization
normalized_cluster <- clustered_slowrel_lined
normalized_cluster[,2:3] <- scale(clustered_slowrel_lined[,2:3])

#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:3], kmeans, method = "wss")

#k-means 
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:3], centers = 2)

# Store the cluster assignments back into the clustering data frame object
clustered_slowrel_lined$cluster <-factor(cluster_solution$cluster) 

# Look at the distribution of cluster assignments
table(clustered_slowrel_lined$cluster)

# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_slowrel_lined %>%
    group_by(cluster) %>%
    summarize_if(is.numeric, mean)

# View the resulting table
kable(sys_clus_avg)

#3 D plot
plot_ly(clustered_slowrel_lined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in) %>%
  add_markers(color = ~cluster)


```

**-Subsurface slow release (unlined) clustering:**

```{r section 17 Subsurface slow release (unlined)	}

clustered_slowrel_unlined <- table_all %>%
  filter(sys_modelinputcategory == "Subsurface slow release (unlined)") %>%
  select(system_id, sys_impervda_ft2, sys_rawstormsizemanaged_in, loading_ratio, sys_modelinputcategory) %>%
  na.omit() %>%
  distinct()

#normalze using mean-sd standardization
normalized_cluster <- clustered_slowrel_unlined
normalized_cluster[,2:3] <- scale(clustered_slowrel_unlined[,2:3])

#Elbow method shows a subtle elbow at cluster 4-hence 4 clusters is optimum
#number of clusters
fviz_nbclust(normalized_cluster[,2:3], kmeans, method = "wss")

#k-means 
# Cluster using kmeans with five clusters
cluster_solution <- kmeans(normalized_cluster[,2:3], centers = 2)

# Store the cluster assignments back into the clustering data frame object
clustered_slowrel_unlined$cluster <-factor(cluster_solution$cluster) 

# Look at the distribution of cluster assignments
table(clustered_slowrel_unlined$cluster)

# Group by the cluster assignment and calculate averages
sys_clus_avg <- clustered_slowrel_unlined %>%
    group_by(cluster) %>%
    summarize_if(is.numeric, mean)

# View the resulting table
kable(sys_clus_avg)

#3 D plot
plot_ly(clustered_slowrel_unlined, x = ~sys_impervda_ft2, y = ~sys_rawstormsizemanaged_in, z = ~loading_ratio) %>%
  add_markers(color = ~cluster)
```

Upon clustering the systems, the data will be collected in a unified data frame for stratified sampling. The stratified sampling creates a balanced list of systems that are reflective of the size of each system type and that of the internal clusters. For example, the code below has generated 58 system (\~ 10% of total unmonitored systems); the size of each system type is proportional to that of the population, and within each type of systems, a proportional number of systems has been chosen from each cluster e.g., in Bioinfiltration type systems, 7 systems are randomly chosen from cluster 3, 3 systems from cluster 2, and 2 systems from cluster 1.

```{r section 18 sampling}
#binding all sub-type groups into one table
sys_clustered_all <- bind_rows(clustered_Bioinfiltration[,c("system_id","cluster","sys_modelinputcategory")], clustered_Bioreten_lined[,c("system_id","cluster","sys_modelinputcategory")], clustered_Bioreten_unlined[,c("system_id","cluster","sys_modelinputcategory")], clustered_slowrel_lined[,c("system_id","cluster","sys_modelinputcategory")], clustered_slowrel_unlined[,c("system_id","cluster","sys_modelinputcategory")], clustered_subsurface[,c("system_id","cluster","sys_modelinputcategory")])

#sampling code
stratified_sample <- sys_clustered_all %>%
    group_by(sys_modelinputcategory, cluster) %>%
    sample_frac(size=0.1)

kable(stratified_sample)
```

To generate monitoring systems, we developed a semi-random approach that utilizes a binning technique. This approach involves assigning systems to clusters based on their specific type, with the binning process utilizing a combination of design metrics to group similar systems together. To ensure a representative sample of systems, the random selection process chooses a proportional number of systems from each cluster and system type based on their size.
